Creating word-level language models for large-vocabulary handwriting recognition
Identifieur interne : 007B68 ( Main/Exploration ); précédent : 007B67; suivant : 007B69Creating word-level language models for large-vocabulary handwriting recognition
Auteurs : John F. Pitrelli [États-Unis] ; Amit Roy [États-Unis]Source :
- International journal on document analysis and recognition : (Print) [ 1433-2833 ] ; 2003.
Descripteurs français
- Pascal (Inist)
English descriptors
- KwdEn :
Abstract
We discuss development of a word-unigram language model for online handwriting recognition. First, we tokenize a text corpus into words, contrasting with tokenization methods designed for other purposes. Second, we select for our model a subset of the words found discussing deviations from an N-most-frequent-words approach. From a 600-million-word corpus, we generated a 53,000-word model which eliminates 45% of word-recognition errors made by a character-level-model baseline system. We anticipate that our methods will be applicable to offline recognition as well, and to some extent to other recognizers, such as speech recognizers and video retrieval systems.
Affiliations:
Links toward previous steps (curation, corpus...)
- to stream PascalFrancis, to step Corpus: 000772
- to stream PascalFrancis, to step Curation: 000271
- to stream PascalFrancis, to step Checkpoint: 000715
- to stream Main, to step Merge: 007F74
- to stream Main, to step Curation: 007B68
Le document en format XML
<record><TEI><teiHeader><fileDesc><titleStmt><title xml:lang="en" level="a">Creating word-level language models for large-vocabulary handwriting recognition</title>
<author><name sortKey="Pitrelli, John F" sort="Pitrelli, John F" uniqKey="Pitrelli J" first="John F." last="Pitrelli">John F. Pitrelli</name>
<affiliation wicri:level="1"><inist:fA14 i1="01"><s1>IBM T.J. Watson Research Center, P.O. Box 218</s1>
<s2>Yorktown Heights, NY 10598</s2>
<s3>USA</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
</inist:fA14>
<country>États-Unis</country>
<wicri:noRegion>Yorktown Heights, NY 10598</wicri:noRegion>
</affiliation>
</author>
<author><name sortKey="Roy, Amit" sort="Roy, Amit" uniqKey="Roy A" first="Amit" last="Roy">Amit Roy</name>
<affiliation wicri:level="1"><inist:fA14 i1="01"><s1>IBM T.J. Watson Research Center, P.O. Box 218</s1>
<s2>Yorktown Heights, NY 10598</s2>
<s3>USA</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
</inist:fA14>
<country>États-Unis</country>
<wicri:noRegion>Yorktown Heights, NY 10598</wicri:noRegion>
</affiliation>
</author>
</titleStmt>
<publicationStmt><idno type="wicri:source">INIST</idno>
<idno type="inist">03-0386044</idno>
<date when="2003">2003</date>
<idno type="stanalyst">PASCAL 03-0386044 INIST</idno>
<idno type="RBID">Pascal:03-0386044</idno>
<idno type="wicri:Area/PascalFrancis/Corpus">000772</idno>
<idno type="wicri:Area/PascalFrancis/Curation">000271</idno>
<idno type="wicri:Area/PascalFrancis/Checkpoint">000715</idno>
<idno type="wicri:explorRef" wicri:stream="PascalFrancis" wicri:step="Checkpoint">000715</idno>
<idno type="wicri:doubleKey">1433-2833:2003:Pitrelli J:creating:word:level</idno>
<idno type="wicri:Area/Main/Merge">007F74</idno>
<idno type="wicri:Area/Main/Curation">007B68</idno>
<idno type="wicri:Area/Main/Exploration">007B68</idno>
</publicationStmt>
<sourceDesc><biblStruct><analytic><title xml:lang="en" level="a">Creating word-level language models for large-vocabulary handwriting recognition</title>
<author><name sortKey="Pitrelli, John F" sort="Pitrelli, John F" uniqKey="Pitrelli J" first="John F." last="Pitrelli">John F. Pitrelli</name>
<affiliation wicri:level="1"><inist:fA14 i1="01"><s1>IBM T.J. Watson Research Center, P.O. Box 218</s1>
<s2>Yorktown Heights, NY 10598</s2>
<s3>USA</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
</inist:fA14>
<country>États-Unis</country>
<wicri:noRegion>Yorktown Heights, NY 10598</wicri:noRegion>
</affiliation>
</author>
<author><name sortKey="Roy, Amit" sort="Roy, Amit" uniqKey="Roy A" first="Amit" last="Roy">Amit Roy</name>
<affiliation wicri:level="1"><inist:fA14 i1="01"><s1>IBM T.J. Watson Research Center, P.O. Box 218</s1>
<s2>Yorktown Heights, NY 10598</s2>
<s3>USA</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
</inist:fA14>
<country>États-Unis</country>
<wicri:noRegion>Yorktown Heights, NY 10598</wicri:noRegion>
</affiliation>
</author>
</analytic>
<series><title level="j" type="main">International journal on document analysis and recognition : (Print)</title>
<title level="j" type="abbreviated">Int. j. doc. anal. recognit. : (Print)</title>
<idno type="ISSN">1433-2833</idno>
<imprint><date when="2003">2003</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
<seriesStmt><title level="j" type="main">International journal on document analysis and recognition : (Print)</title>
<title level="j" type="abbreviated">Int. j. doc. anal. recognit. : (Print)</title>
<idno type="ISSN">1433-2833</idno>
</seriesStmt>
</fileDesc>
<profileDesc><textClass><keywords scheme="KwdEn" xml:lang="en"><term>Character recognition</term>
<term>Handwriting recognition</term>
<term>Language recognition</term>
<term>Pattern recognition</term>
<term>Syntactic analysis</term>
<term>Token</term>
</keywords>
<keywords scheme="Pascal" xml:lang="fr"><term>Reconnaissance écriture</term>
<term>Reconnaissance forme</term>
<term>Reconnaissance langage</term>
<term>Reconnaissance caractère</term>
<term>Analyse syntaxique</term>
<term>Unigram</term>
<term>Tokenization</term>
<term>Word-level language model</term>
<term>Jeton</term>
</keywords>
</textClass>
</profileDesc>
</teiHeader>
<front><div type="abstract" xml:lang="en">We discuss development of a word-unigram language model for online handwriting recognition. First, we tokenize a text corpus into words, contrasting with tokenization methods designed for other purposes. Second, we select for our model a subset of the words found discussing deviations from an N-most-frequent-words approach. From a 600-million-word corpus, we generated a 53,000-word model which eliminates 45% of word-recognition errors made by a character-level-model baseline system. We anticipate that our methods will be applicable to offline recognition as well, and to some extent to other recognizers, such as speech recognizers and video retrieval systems.</div>
</front>
</TEI>
<affiliations><list><country><li>États-Unis</li>
</country>
</list>
<tree><country name="États-Unis"><noRegion><name sortKey="Pitrelli, John F" sort="Pitrelli, John F" uniqKey="Pitrelli J" first="John F." last="Pitrelli">John F. Pitrelli</name>
</noRegion>
<name sortKey="Roy, Amit" sort="Roy, Amit" uniqKey="Roy A" first="Amit" last="Roy">Amit Roy</name>
</country>
</tree>
</affiliations>
</record>
Pour manipuler ce document sous Unix (Dilib)
EXPLOR_STEP=$WICRI_ROOT/Wicri/Lorraine/explor/InforLorV4/Data/Main/Exploration
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 007B68 | SxmlIndent | more
Ou
HfdSelect -h $EXPLOR_AREA/Data/Main/Exploration/biblio.hfd -nk 007B68 | SxmlIndent | more
Pour mettre un lien sur cette page dans le réseau Wicri
{{Explor lien |wiki= Wicri/Lorraine |area= InforLorV4 |flux= Main |étape= Exploration |type= RBID |clé= Pascal:03-0386044 |texte= Creating word-level language models for large-vocabulary handwriting recognition }}
This area was generated with Dilib version V0.6.33. |